This notebook explores and models the data collected from recordings of the natural acoustic environment over the urban-rural gradient near Innsbruck, Austria. The models are implemented as Bayesian models with the PyMC3 probabilistic programming library.
References:
https://github.com/fonnesbeck/multilevel_modeling
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas
import numpy
In [3]:
%matplotlib inline
from matplotlib import pyplot
from matplotlib.patches import Rectangle
import seaborn
import mpld3
from mpld3 import plugins
In [4]:
from pymc3 import glm, Model, NUTS, sample, \
forestplot, traceplot, plot_posterior, summary, \
Normal, Uniform, Deterministic, StudentT, stats
set chart style to seaborn defaults
In [5]:
seaborn.set()
In [6]:
data_filepath = "/Users/Jake/OneDrive/Documents/alpine soundscapes/data/dataset.csv"
In [7]:
seaborn_blue = seaborn.color_palette()[0]
In [8]:
data = pandas.read_csv(data_filepath)
data = data.loc[data.site<=30]
In [9]:
data[['site', 'site_name', 'building_200m', 'forest_200m', 'd2n_200m']].head()
Out[9]:
sort data by site and then by visit
In [10]:
data = data.sort_values(by=['site', 'sound']).reset_index(drop=True)
In [11]:
data[['site', 'site_name', 'biophony',
'forest_200m', 'd2n_200m', 'temperature', 'week']].head(n=10)
Out[11]:
Here is a pairwise plot of a subset of the data. The 'week' variable represents the number of week each recording was taken from the start of field collection (Janurary 27, 2016). The 'd2n_200m' represents the average Distance to Nature (d2n) value, a landscape index representing the amount of human alteration, within a 200-meter radius of each recording location. The 'temperature' variable represents the temperature in degrees Celcius, averaged from the measurements taken at the University of Innsbruck weather station 30 minutes before and after each recording.
In [12]:
pairplot = seaborn.pairplot(data[['biophony', 'sel',
'week', 'temperature', 'd2n_500m']])
There is definitely a relationship between biophony and landcover, although a more robust analysis might be able to explain the relationship better than the simple regression shown below.
In [13]:
biophony_jointplot = seaborn.jointplot('d2n_100m', 'biophony', data=data, kind='reg',
xlim=(-0.1, 0.6), ylim=(0, 15))
Landcover seems to have a greater effect on anthrophony.
In [14]:
anthrophony_jointplot = seaborn.jointplot('d2n_100m', 'sel', data=data, kind='reg',
xlim=(-0.1, 0.6), ylim=(-70, -10))
As the boxplots show, repeated measurements of percent biophony and sel varry little at some sites and greatly at others.
In [15]:
biophony_boxplot = seaborn.boxplot('site', 'biophony', data=data)
biophony_boxplot.figure.set_figwidth(10)
In [16]:
anthrophony_boxplot = seaborn.boxplot('site', 'sel', data=data)
anthrophony_boxplot.figure.set_figwidth(10)
We can look at a standard regression analysis to compare it with more robust multilevel models. We start with a "complete pooling" analysis, considering only the 'week' variable (effectively time of year) as a predictor for biophony.
We start with the model defined above, where yi is the biophony at recording i, alpha is the intercept, beta is the slope, and xi is the number of week from the start of field collection for recroding i.
We can define the model in PyMC3 and run it with 1000 samples.
In [17]:
with Model() as pooled_model:
glm.GLM.from_formula("biophony ~ week", data)
pooled_trace = sample(1000, NUTS(), random_seed=1)
In [18]:
p = plot_posterior(pooled_trace, varnames=['Intercept', 'week'])
# TODO: fix font style
Here are the mean parameter values (avoiding the burn-in period, or the first 500 samples).
In [19]:
intercept0 = pooled_trace['Intercept'][500:].mean()
slope0 = pooled_trace['week'][500:].mean()
In [20]:
intercept0
Out[20]:
In [21]:
slope0
Out[21]:
Here is a plot of the data and the fitted regression line.
In [22]:
fig, ax = pyplot.subplots(figsize=(15, 5))
ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)
x_week = numpy.linspace(0, 100)
# plot 50 samples from the posterior of the parameters
for d in numpy.random.random_integers(500, 999, 50):
p1 = ax.plot(x_week, pooled_trace['Intercept'][d] + pooled_trace['week'][d] * x_week,
color=seaborn_blue, alpha=0.1)
# plot mean parameter values
p1 = ax.plot(x_week, intercept0 + slope0*x_week, 'k--')
l1 = ax.set_xlabel("weeks since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 14])
f1 = ax.set_ylim([0, 15])
The complete pooling model above does not allow the regression parameters to vary by site, which is likely unrealistic. For example, it seems probable that the change of biophony over time could vary between location. We can see what happens if we fit a separate regression to each site (different intercept (alpha) and slope (beta) for each site j).
In [23]:
sample_sites = (2, 9, 11, 15, 17, 18, 26, 30)
In [24]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
# plot data and regression line
y = data.biophony[data.site==site]
x = data.week[data.site==site]
seaborn.regplot(x, y, ax=ax[i], ci=None, color='grey')
# plot complete pooling regression line and set format
x_week = numpy.linspace(0, 100)
ax[i].plot(x_week, (slope0 * x_week) + intercept0, 'r--')
ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
ax[i].set_ylim([0, 15])
We can see from the no pooling example that biophony does seem to change differently at each location. However, as shown below, a multilevel model allows us to apply a single model to the data set while allowing regression parameters to vary between sites. For sites with a lesser amount of measurements, the regression parameters are pulled towards the parameters of the entire data set (towards complete pooling), but for sites with a large amount of measurements, the regression parameters are not pulled towards the entire data set (towards no pooling).
The model shown below contains no predictors (refered to as the empty model), and thus shows only how the mean biophony varies between locations.
In the equation above, y represents the biophony for recording i and location j, beta represents the mean biophony of each location j, gamma represents the grand mean of biophony (mean of the location means), u represents the difference between the mean biophony at location j and the grand mean biophony, and r represents the difference between the biophony of recording i at location j and the mean biophony at location j. The model can also be thought of as defining a separate intercept for each site with no slope.
We can define this model in PyMC3 as shown below.
NOTE: While the model equations are shown in frequentist form, PyMC3 implements the models in a Bayesian context.
In [25]:
sites = numpy.copy(data.site.values) - 1
In [26]:
with Model() as partial_pooling:
# Priors
mu_a = Normal('mu_a', mu=0., tau=0.0001)
sigma_a = Uniform('sigma_a', lower=0, upper=100)
tau_a = sigma_a**-2
# Random intercepts
a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# Expected value
y_hat = a[sites]
# Data likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)
In [27]:
with partial_pooling:
step = NUTS()
partial_pooling_samples = sample(2000, step, random_seed=1)
We can plot both the unpooled means and the partially-pooled means.
In [28]:
sample_trace = partial_pooling_samples['a'][-1000:]
fig, ax = pyplot.subplots(1, 2, figsize=(14,6), sharex=True, sharey=True)
samples, sites = sample_trace.shape
jitter = numpy.random.normal(scale=0.1, size=sites)
n_site = data.groupby('site')['biophony'].count()
unpooled_means = data.groupby('site')['biophony'].mean()
unpooled_sd = data.groupby('site')['biophony'].std()
unpooled = pandas.DataFrame({'n':n_site, 'm':unpooled_means, 'sd':unpooled_sd})
unpooled['se'] = unpooled.sd / numpy.sqrt(unpooled.n)
# plot unpooled means
ax[0].scatter(unpooled.n + jitter, unpooled.m, marker='o', color='black')
for j, row in zip(jitter, unpooled.iterrows()):
name, dat = row
ax[0].plot([dat.n+j,dat.n+j], [dat.m-dat.se, dat.m+dat.se],
linestyle='-', color='black', alpha=0.25)
t1 = ax[0].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
ax[0].set_xticks([3, 4, 5])
t2 = ax[0].set_title('unpooled means')
t3 = ax[0].set_xlabel('number of measurements (jittered)')
t4 = ax[0].set_ylabel('biophony')
# plot partially-pooled means
samples, sites = sample_trace.shape
means = sample_trace.mean(axis=0)
sd = sample_trace.std(axis=0)
ax[1].scatter(n_site.values + jitter, means, marker='o', color='purple')
ax[1].set_xlim(2, 6)
ax[1].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
for j,n,m,s in zip(jitter, n_site.values, means, sd):
ax[1].plot([n+j]*2, [m-s, m+s], linestyle='-', color='purple', alpha=0.25)
t1 = ax[1].set_title('partially-pooled means')
t2 = ax[1].set_xlabel('number of measurements (jittered)')
t3 = ax[1].set_ylabel('biophony')
Note that there is less pooling for sites with a higher number of measurements.
Going forward with the multilevel model and partial pooling, we can begin to add predictors. This model adds the 'week' variable as a predictor, but does not allow the effect (slope) to vary between locations. However, in this model the intercepts are allowed to vary, but without a predictor.
In [29]:
sites = numpy.copy(data.site.values) - 1
In [30]:
with Model() as varying_intercept:
# Priors
mu_a = Normal('mu_a', mu=0., tau=0.0001)
sigma_a = Uniform('sigma_a', lower=0, upper=100)
tau_a = sigma_a**-2
# Random intercepts
a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
# Common slope
b = Normal('b', mu=0., tau=0.0001)
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# Expected value
y_hat = a[sites] + b * data.week
# Data likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)
In [31]:
with varying_intercept:
step = NUTS()
varying_intercept_samples = sample(2000, step, random_seed=1)
In [32]:
site_labels = ["{1} - {0}".format(s,
data.site_name[data.site == s].unique()[0])\
for s in range(1, 31)]
pyplot.figure(figsize=(5, 8))
p1 = forestplot(varying_intercept_samples, varnames=['a'],
ylabels=site_labels,
xtitle="biophony (intercept)")
We can also plot the sample traces and distributions for other variables defined in the model. Here 'sigma_a' represents the variance between the grand biophony mean and each location biophony mean, and 'b' represents the slope for all locations.
In [33]:
p1 = traceplot(varying_intercept_samples[-1000:], varnames=['sigma_a', 'b'])
We can also print a summary with confidence for each model variable. The summary below shows that the estimated slope ('b') parameter is 0.259, meaning that for each increase in 'week' the biophony value increases by 0.259
In [34]:
summary(varying_intercept_samples[-1000:], varnames=['b'])
Now we can plot a regression line for each location.
In [35]:
fig, ax = pyplot.subplots(figsize=(15, 5))
x_week = numpy.arange(0, 110, 10)
bp = varying_intercept_samples['a'].mean(axis=0)
mp = varying_intercept_samples['b'].mean()
ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)
for bi in bp:
ax.plot(x_week, mp * x_week + bi, color='purple', marker=None, alpha=0.1)
l1 = ax.set_xlabel("week since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 14])
f2 = ax.set_ylim([-5, 15])
Here are separate plots for each location, showing three different regressions.
In [36]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
# Plot county data and unpooled regression
y = data.biophony[data.site==site]
x = data.week[data.site==site]
seaborn.regplot(x, y, ax=ax[i], ci=None, color='grey')
# plot pooled and partially-pooled regressions
x_week = numpy.linspace(0, 14)
# Pooled estimate
ax[i].plot(x_week, slope0 * x_week + intercept0, 'r--')
# Partial pooling esimate
ax[i].plot(x_week, mp * x_week + bp[site - 1], color='purple', marker=None, linestyle='-')
ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
ax[i].set_ylim([-5, 15])
We can also see what happens if we fix a common intercept, but allow the slope to vary between locations.
In [37]:
sites = numpy.copy(data.site.values) - 1
In [38]:
with Model() as varying_slope:
# Priors
# mu_b = Normal('mu_b', mu=0., tau=0.0001)
sigma_b = Uniform('sigma_b', lower=0, upper=100)
tau_b = sigma_b**-2
# Model intercept
# a = Normal('a', mu=0., tau=0.0001)
a = Uniform('a', lower=-10, upper=10)
# Random slopes
# b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
b = Uniform('b', lower=-10, upper=10, shape=len(set(sites)))
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# Expected value
y_hat = a + b[sites] * data.week
# Data likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)
In [39]:
with varying_slope:
step = NUTS()
varying_slope_samples = sample(2000, step, random_seed=1)
Here is a plot of the estimated slope for each location.
In [40]:
site_labels = ["{1} - {0}".format(s,
data.site_name[data.site == s].unique()[0])\
for s in range(1, 31)]
pyplot.figure(figsize=(5, 8))
p1 = forestplot(varying_slope_samples, varnames=['b'],
ylabels=site_labels,
xtitle="slope")
In [42]:
vs_trace = varying_slope_samples[:1500]
traceplot(vs_trace, varnames=['b'])
Out[42]:
This result is interesting, as locations with an estimated slope near 0 (e.g., Wilten and Templstraße) are generally located in areas with mostly urban land cover, and other locations with higher estimated slopes (e.g., Pumhof and Wittenberg) are generally located in areas with a large amount of forest land cover.
Here is a plot of the regression lines for all the locations.
In [43]:
fig, ax = pyplot.subplots(figsize=(15, 5))
# all data
p1 = ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)
# varying slopes
x_week = numpy.arange(0, 16, 1)
b = varying_slope_samples['a'].mean()
m = varying_slope_samples['b'].mean(axis=0)
for mi in m:
pi = ax.plot(x_week, mi * x_week + b, marker=None, color=seaborn_blue, alpha=0.1)
# pooled
p2 = ax.plot(x_week, intercept0 + slope0 * x_week, marker=None, color='black', linestyle='--')
# formatting
l1 = ax.set_xlabel("week since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 15])
Now we can try varying both slope and intercept.
In [44]:
sites = numpy.copy(data.site.values) - 1
In [45]:
with Model() as varying_intercept_slope:
# Priors
mu_a = Normal('mu_a', mu=0., tau=0.0001)
sigma_a = Uniform('sigma_a', lower=0, upper=100)
tau_a = sigma_a**-2
mu_b = Normal('mu_b', mu=0., tau=0.0001)
sigma_b = Uniform('sigma_b', lower=0, upper=100)
tau_b = sigma_b**-2
# Random intercepts
a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
# Random slopes
b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# Expected value
y_hat = a[sites] + b[sites] * data.week
# Data likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)
y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)
In [46]:
with varying_intercept_slope:
step = NUTS()
varying_intercept_slope_samples = sample(2000, step, random_seed=1)
In [47]:
fig, ax = pyplot.subplots(figsize=(15, 5))
# all data
p1 = ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)
# varying slopes and intercepts
x_week = numpy.arange(0, 16, 1)
b = varying_intercept_slope_samples['a'].mean(axis=0)
m = varying_intercept_slope_samples['b'].mean(axis=0)
for bi,mi in zip(b,m):
pi = ax.plot(x_week, mi * x_week + bi, marker=None, color=seaborn_blue, alpha=0.1)
# pooled
p2 = ax.plot(x_week, intercept0 + slope0 * x_week, marker=None, color='black', linestyle='--')
# formatting
l1 = ax.set_xlabel("week since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 15])
Here are plots for individual locations again, comparing this model (orange) compares with the unpooled model (gray).
In [48]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
# Plot pooled and partially-pooled regressions
x_week = numpy.linspace(0, 16)
# Pooled estimate
ax[i].plot(x_week, slope0 * x_week + intercept0, 'r-', alpha=0.5)
# Partial pooling esimate
ax[i].plot(x_week, mp * x_week + bp[site - 1], color='purple', marker=None, linestyle='-', alpha=0.5)
# Varying slopes and intercepts
for d in numpy.random.random_integers(500, 1999, 100):
ax[i].plot(x_week,
(varying_intercept_slope_samples['b'][d][site - 1] * x_week) + varying_intercept_slope_samples['a'][d][site - 1],
color=seaborn_blue, marker=None, alpha=0.1)
# Plot site data and unpooled regression
y = data.biophony[data.site==site]
x = data.week[data.site==site]
seaborn.regplot(x, y, ax=ax[i], ci=None, color='black', line_kws={'linestyle': '--', 'linewidth': 0.5})
ax[i].plot(x_week, m[site - 1] * x_week + b[site - 1], color='black', marker=None, linestyle='--')
ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
ax[i].set_xlim([0, 15])
ax[i].set_ylim([-2, 16])
In [49]:
trace = varying_intercept_slope_samples[-1500:]
a_means = trace['a'].mean(axis=0)
b_means = trace['b'].mean(axis=0)
Now we can try to see if we can assign a predictor to the level 2 (group or location level) parameters. Land cover is likely to be a significant predictor of the amount of biophony (intercept) and how it changes over time (slope).
There are many predictor variables in the data set, but we can define an aggregate predictor from what is likely to be the predictors with the largest effect. We define the 'forest_200m' variable that is the percentage of forest land cover minus the combined percentage of buildings and paved surfaces land cover within a 200-meter radius of each recording location. We can later check the fit of the model with many different predictor variables.
In [50]:
data[['site', 'd2n_200m', 'forest_200m']].groupby('site').mean().head()
Out[50]:
Here is the model, this time defined in Bayesian form.
where $x_i$ is the measurement-level week indicator and $u_j$ is the location-level land cover indicator
We transform and center some of the variables to make it easier to interpret the model results.
In [51]:
data['biophony_centered'] = data.biophony - data.biophony.mean()
data['weeks_centered'] = data.week - data.week.mean()
data['forest_200m_centered'] = data.forest_200m - data.forest_200m.mean()
data['d2n_200m_centered'] = data.d2n_200m - data.d2n_200m.mean()
In [52]:
sites = numpy.copy(data.site.values) - 1
In [53]:
with Model() as model_3:
# intercept
g_00 = Normal('g_00', mu=0, tau=0.001)
g_01 = Normal('g_01', mu=0, tau=0.001)
sigma_a = Uniform('sigma_a', lower=0, upper=100)
tau_a = sigma_a**-2
mu_a = g_00 + (g_01 * data.forest_200m_centered.unique())
a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
# slope
g_10 = Normal('g_10', mu=0, tau=0.001)
g_11 = Normal('g_11', mu=0, tau=0.001)
sigma_b = Uniform('sigma_b', lower=0, upper=100)
tau_b = sigma_b**-2
mu_b = g_10 + (g_11 * data.forest_200m_centered.unique())
b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
# model error (data-level)
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# expected values
y_hat = a[sites] + (b[sites] * data.weeks_centered)
# likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony_centered)
# simulated
#y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)
In [54]:
with model_3:
step = NUTS()
model_3_samples = sample(10000, step, random_seed=1)
We can compute the estimated intercept and slope for each location based on the mean of the last 5000 samples from the model trace.
In [55]:
trace = model_3_samples[-5000:]
a_means = trace['a'].mean(axis=0)
b_means = trace['b'].mean(axis=0)
Now we can compare how well the estimated level 2 parameters predict the estimated level 1 parameters.
In [56]:
fig, ax = pyplot.subplots(1, 2, sharex=True, figsize=(15, 5))
# plot estimated level 2 parameter (intercept)
intercept_points = ax[0].scatter(data.forest_200m_centered.unique(), a_means, color='black', alpha=0.4)
g00 = trace['g_00'].mean()
g01 = trace['g_01'].mean()
x_forest = numpy.linspace(-70, 70)
p1 = ax[0].plot(x_forest, g00 + (g01 * x_forest), color='black', linestyle='--')
pl = ax[0].set_xlim(-60, 60)
# plot estimated level 1 parameters
a_se = trace['a'].std(axis=0)
for xi, m, se in zip(data.forest_200m_centered.unique(), a_means, a_se):
pi = ax[0].plot([xi, xi], [m-se, m+se], color='black', alpha=0.1)
t1 = ax[0].set_xlabel('net forest within a 200 meter radius')
t2 = ax[0].set_ylabel('intercept estimate (difference from mean biophony)')
t0 = ax[0].set_title('predicting intercept estimates')
# for interactive plot via mpld3
#intercept_point_labels = ["{0}: {1:0.2f}".format(n, s) for n, s in zip(data.site_name.unique(), a_means)]
#tooltip = plugins.PointLabelTooltip(intercept_points, intercept_point_labels)
#plugins.connect(fig, tooltip)
#mpld3.display(fig)
# plot estimated level 2 parameter (slope)
slope_points = ax[1].scatter(data.forest_200m_centered.unique(), b_means, color='black', alpha=0.4)
g10 = trace['g_10'].mean()
g11 = trace['g_11'].mean()
x_forest = numpy.linspace(-70, 70)
p2 = ax[1].plot(x_forest, g10 + (g11 * x_forest), color='black', linestyle='--')
pl = ax[1].set_xlim(-60, 60)
# plot estimated level 1 parameters
b_se = trace['b'].std(axis=0)
for xi, m, se in zip(data.forest_200m_centered.unique(), b_means, b_se):
pi = ax[1].plot([xi, xi], [m-se, m+se], color='black', alpha=0.1)
t3 = ax[1].set_xlabel('net forest within a 200 meter radius')
t4 = ax[1].set_ylabel('slope estimate')
t5 = ax[1].set_title('predicting slope estimates')
# for interactive plot via mpld3
#slope_point_labels = ["{0}: {1:0.2f}".format(n, s) for n, s in zip(data.site_name.unique(), b_means)]
#tooltip = plugins.PointLabelTooltip(slope_points, slope_point_labels)
#plugins.connect(fig, tooltip)
#mpld3.display(fig)
Above, we can see that the land cover variable (forest_200m) can decentely predict how much biophony changes at each location over time (right plot). The intercept values do not fit quite as well, although there still seems to be a clear relationship (left plot). The two major outliers may be caused by errors in the source separation algorithm. We can compute an R squared value for each level 2 parameter and for all of level 1 (see below).
First, here are individual plots for all the locations.
In [57]:
fig, ax = pyplot.subplots(6, 5, figsize=(15, 20), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(data.site.unique()):
x_week = numpy.linspace(-10, 10, 2)
# draw parameter samples
for d in numpy.random.random_integers(0, 4999, 100):
ax[i].plot(x_week,
(trace['b'][d][site - 1] * x_week) + trace['a'][d][site - 1],
color=seaborn_blue, marker=None, alpha=0.1)
# observed data (points)
y = data.biophony_centered[data.site==site]
x = data.weeks_centered[data.site==site]
ax[i].scatter(x, y, color='black')
# model estimate (line)
x_week = numpy.linspace(-10, -10, 2)
ax[i].plot(x_week, a_means[i] + (b_means[i] * x_week), color='black', linestyle='--')
ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
ax[i].text(-6, 12, "slope: {0:.02f}".format(b_means[i]), va='bottom')
ax[i].set_xlim([-7, 7])
ax[i].set_ylim([-10, 15])
As defined in Gelman and Hill 2006, we can compute (effective) R squared values for the multilevel model.
In [58]:
# level 1
a_means_expanded = numpy.array([a_means[v-1] for v in data.site])
b_means_expanded = numpy.array([b_means[v-1] for v in data.site])
y_hat = a_means_expanded + (b_means_expanded * data.weeks_centered)
e_y = data.biophony_centered - y_hat
r2_level1 = 1 - (e_y.var() / data['biophony_centered'].var())
# level 2
a_hat = g00 + (g01 * data.groupby('site')['forest_200m'].mean())
b_hat = g10 + (g11 * data.groupby('site')['forest_200m'].mean())
e_a = a_means - a_hat
e_b = b_means - b_hat
# R squared for intercept estimates
r2_level2a = 1 - (e_a.var() / a_means.var())
# R squared for slope estimates
r2_level2b = 1 - (e_b.var() / b_means.var())
In [59]:
print("R squared values:\n\
level 1\n\
----------\n\
overall: {0:0.2f}\n\
\n\
level 2\n\
----------\n\
intercepts: {1:0.2f}\n\
slopes: {2:0.2f}".format(r2_level1, r2_level2a, r2_level2b))
In [60]:
trace.varnames
Out[60]:
In [61]:
p1 = forestplot(trace, varnames=['sigma_a', 'sigma_b', 'sigma_y'])
In [62]:
data['temperature_centered'] = data.temperature - data.temperature.mean()
In [63]:
data['wind_speed_centered'] = data.wind_speed - data.wind_speed.mean()
In [64]:
sites = numpy.copy(data.site.values) - 1
In [68]:
with Model() as model_4:
# intercept
g_00 = Normal('g_00', mu=0, tau=0.001)
g_01 = Normal('g_01', mu=0, tau=0.001)
sigma_a = Uniform('sigma_a', lower=0, upper=100)
tau_a = sigma_a**-2
mu_a = g_00 + (g_01 * data.forest_200m_centered.unique())
a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
# slope (land cover)
g_10 = Normal('g_10', mu=0, tau=0.001)
g_11 = Normal('g_11', mu=0, tau=0.001)
sigma_b = Uniform('sigma_b', lower=0, upper=100)
tau_b = sigma_b**-2
mu_b = g_10 + (g_11 * data.forest_200m_centered.unique())
b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
# temperature
c = Uniform('c', lower=-10, upper=10, shape=len(set(sites)))
# model error (data-level)
sigma_y = Uniform('sigma_y', lower=0, upper=100)
tau_y = sigma_y**-2
# expected values
y_hat = a[sites] + (b[sites] * data.weeks_centered) + (b[sites] * data.temperature_centered)
# likelihood
y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony_centered)
# simulated
#y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)